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TECHNIQUE FOR IDENTIFICATION OF INFORMATION BASED 

ON PROTOCOL MARKERS 



FIELD OF THE INVENTION 

The present invention relates to identification of information and, in particular, to 
the identification of time- variant multi-dimensional information in a distributed network 
using a signature generated from protocol markers contained in the information. 

BACKGROUND INFORMATION 

In modern network environments, information may be stored in a plurality of re- 
mote locations using a wide variety of storage mechanisms. For example, information 
embodied as a file in a storage system's file system may be stored on a disk locally at- 
tached to a computer, on a storage system connected to a computer via a network at- 
tached storage (NAS) arrangement, or by a high-speed storage area network (SAN) con- 
figuration. In a network storage configuration, e.g., a NAS or SAN environment, various 
intermediate nodes may be present including, for example, routers, switches, network 
caching devices and file caching devices. Copies of the information persist for a period 
of time consistent with the intermediate node functionality. 

Applications executing on devices in a network often desire to compare or other- 
wise differentiate information that is stored in a plurality of remote locations. For exam- 
ple, an application executing on a networked computer may desire to know whether in- 
formation such as data stored locally on a disk is identical to data stored across a network 
on disks attached to a storage system. This may include when the information is time 
variant multi-dimensional information, such as a multi-media signal, that occupies a large 
amount of space and require a significant amount of bandwidth to transmit across a net- 
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work. As a further example, a network caching device may desire to know if data stored 
in its local cache is identical to data currently being requested from a remotely located 
data store. If the data that is stored locally is identical to the data stored remotely, the 
network caching device may forego the need of issuing networked-base data access 
commands to obtain the requested data, thereby improving system performance speed 
and reducing network bandwidth loads by eliminating unnecessary data requests and 
transfers. 

This differentiation of the data may be accomplished by comparing the two data 
sources. Traditionally, data has been compared using a bit-by-bit comparison wherein 
each bit of the first data source is compared with its corresponding bit in the second data 
source. Any differences between the two data sources may be identified using this "brute 
force" technique. However, noted disadvantages of such a bit-by-bit comparison are the 
high level of computational power expended and the time required to perform such a 
comparison, especially on large data files. Another noted disadvantage is the possibility 
of needing to transmit the entire data file over a network to perform the bit-by-bit com- 
parison, thereby eliminating any potential gains in reducing bandwidth consumption. 

An alternate method for differentiating data is the use of a cross-correlation tech- 
nique, whereby a correlation procedure is performed between the two sets of data to de- 
termine if they contain the same content. Cross correlation techniques may be effective, 
when, for example, the data is stored in differing file systems that have differing headers 
(e.g., metadata) prepended to the actual data, for example, when one copy of the data is 
stored using the Microsoft Windows NT file system (NTFS) and a second copy of the 
data is stored using a UNIX-based file system, such as the original Unix File System 
format (UFS) or Berkely Fast File System (FFS). However, the cross-correlation method 
is also computationally intensive for large data files. Moreover, to improve the correla- 
tion results a larger amount of data needs to be transmitted, again reducing any potential 
bandwidth consumption savings. 

A third conventional method for comparing two data sets involves the use of 
metadata, which allows a user to quickly compare large volumes of data. Examples of 
metadata that may be used to compare data include file names, sizes, and/or dates of 
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creation. As noted above, various computer systems may implement metadata differ- 
ently. For example, some systems may have a limit on the number of characters that may 
be in the file name or may not permit certain characters to be used in a file name. Addi- 
tionally, some systems may record a date of creation and a date when a file was last 
modified; however, other systems may only record the date a file was originally created 
and not later modification dates. 

However, a noted disadvantage of the metadata comparison method is the possi- 
bility of negative matches occurring even where identical content is present due to differ- 
ences in the associated metadata. Similarly, it is possible to obtain a positive match when 
the data content is not identical, but when the metadata indicates a match. Metadata in- 
formation is not uniformly implemented and/or deployed in heterogeneous networks such 
as the Internet. Thus, it is possible to encounter differences in the metadata associated 
with two data files even when the underlying data is identical. For example, assume two 
identical copies of a file with differing filenames. Decisions based on matching or non- 
matching of comparisons of metadata associated with the files may be incorrect and lead 
to erroneous conclusions. 

The typical techniques for comparing or identifying information have noted dis- 
advantages. As network environments grow larger and the use of remotely stored data 
expands, a significant amount of computational time and network resources may be 
wasted in identifying and comparing data across a network. 

SUMMARY OF THE INVENTION 

The disadvantages of the prior art are overcome by providing a technique for 
identification of information based upon protocol markers. According to the technique a 
signature is generated from a protocol used to store, distribute and transport time-variant 
multi-dimensional information, such as "real-world" signal and multi-media data that 
uniquely identifies the information. The signature comprises a set of protocol markers 
that is unique to the protocol. Using the extracted signature, the system and method can 
differentiate amongst a plurality of data. Identification of the data is necessary to ensure 
uniqueness of that information and to compare various data in a distributed environment. 
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The storage, distribution or transportation of real-world signal, such as an audio 
visual scene, onto a medium requires a transformation of that information via a protocol. 
This protocol transformation results in a representation of the information that is matched 
to the appropriate medium. In case of time variant multi-dimensional information (con- 
tent), the protocol is typically used to transform the information into a form suitable to 
the medium. Such transformation may include a sampling stage, followed by one or 
more conversion stages, a quantization stage, and finally an entropy compression stage. 
Each type of transformation illustratively, represented by a different operation, results in 
unique markers in the transformed content that enables a device implementing the novel 
technique to identify and differentiate content resulting from transformations related to 
the specific protocol. 

By utilizing a priori knowledge of how the specific protocol is implemented, re- 
sidual protocol markers embedded in content are utilized to quickly and efficiently iden- 
tify the content. These protocol markers are by-products of the specific mathematical 
transformations performed in the course of encoding, e.g., a real-world signal to a me- 
dium via the specified protocol. Each specific protocol, e.g., MPEG-2, JPEG, etc., con- 
tains a unique set of protocol markers derived from the protocol. The use of the protocol 
markers to identify content eliminates the need for computationally expensive bit-by-bit 
comparisons or reliance on metadata implementations. By utilizing the known protocol 
markers, content may be quickly identified and/or compared to determine uniqueness. As 
protocols typically reduce or compress the representation of the underlying information, 
these protocols also typically provide unique markers that are condensed from the infor- 
mation, thereby requiring fewer computational resources to identify and differentiate. 

BRIEF DESCRIPTION OF THE DRAWINGS 

The above and further advantages of the invention may be understood by referring 
to the following description in conjunction with the accompanying drawings in which 
like reference numerals indicate identical or functionally similarly elements: 

Fig. 1 is a schematic block diagram of an exemplary network environment in ac- 
cordance with an embodiment of the present invention; 
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Fig. 2 is a schematic block diagram of an exemplary protocol conversion flow in 
accordance with an embodiment of the present invention; 

Fig. 3 is a schematic block diagram of an exemplary protocol conversion flow for 
MPEG-2 protocol in accordance with an embodiment of the present invention; 

Fig. 4 is a schematic block diagram of an exemplary content comparator in accor- 
dance with an embodiment of the present invention; 

Fig. 5 is a flow chart detailing the steps of a procedure for comparing content in 
accordance with an embodiment of the present invention; and 

Fig. 6 is a flow chart detailing the steps of a procedure performed by networked 
caching device utilizing the teachings of the present invention. 

DETAILED DESCRIPTION OF AN ILLUSTRATIVE 

EMBODIMENT 

By way of further background, time variant multi-dimensional information 
(TVMD), which may be further identified as a "real-world" multi-sensory or, more gen- 
erally, multi-media signal information, typically contains a protocol markers that 
uniquely identifies the information. The storage, distribution or transportation of such a 
real-world signal, e.g., an audio/visual representation, onto a networking medium re- 
quires a transformation of the information via a defined protocol. Examples of such a 
defined protocol include the well-known Moving Picture Expert Group (MPEG), Joint 
Photographic Expert Group (JPEG) and Graphics Interchange Format (GIF) protocol 
specification formats. It should be noted that the teachings of the present invention are 
applicable to any protocol that includes or generates appropriate protocol markers, as de- 
scribed further below, in the transformed data. These protocol transformations result in a 
representation of the information that is matched to the desired medium. This represen- 
tation will be termed as "content" or "data content" herein. 

The transformation, via the defined protocol, may include various steps, includ- 
ing, a sampling stage, one or more conversion stages, a quantization stage and/or an en- 
tropy compression stage. Each stage in the transformation "chain" is described by one or 
more individually defined components of the protocol specification. Each protocol stage 
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typically reduces the information content of the original signal and, in the case of non- 
expansive transformations, works to compress the information contained in the original 
signal. The resulting compression chain enables easier transmission of the content via 
digital transmission media, such as local area networks (LANs) utilizing Ethernet-based 
cabling or other transport channel. Typically each stage also generates a protocol marker 
in the content that may be analyzed without reading the entire information content. Each 
specific protocol transforms content according to a well-known protocol specification. 
By inverting the transformations and traversing the transformation chain in reverse order, 
the original signal may be recovered, albeit with possible distortions introduced by the 
loss of information content. The novel system and method of the present invention util- 
izes the transformed information's protocol markers to generate a signature that uniquely 
identifies information without requiring a bit by bit comparison or relying on metadata 
associated with the information. 

A. Exemplary Network Environment 

Fig. 1 is a schematic block diagram of an exemplary network environment 100 in 
which the principles of the present invention may be implemented. The network envi- 
ronment 100 is based around a network cloud 105, which may comprise point-to-point 
links, a wide area network (WAN), a virtual private network (VPN) implemented over a 
public network, a shared local area network (LAN) or any other acceptable networking 
architecture. Attached to the exemplary network cloud 105 is an intermediate node, such 
as a router 155 that may connect the network cloud 105 to other networks including, e.g., 
the well-known Internet 160. Additionally, a user system 110, e.g., a workstation or per- 
sonal computer (PC), is connected to the network 105 via a conventional network inter- 
face controller (NIC) (not shown). Connected to the user system 1 10 is a storage de- 
vice 115 which, in the illustrative embodiment, comprises a disk. 

Also connected to the network cloud 105 are a first storage system 125, which 
may comprise a file server or other storage appliance having a storage device 130, a sec- 
ond storage system 140 (and associated storage device 145) and a network or file-system 
caching device 170. The network or file-system caching device 170 stores recently re- 
trieved data so that it may forward its local copy of data to a requesting system instead of 
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forwarding a network data access request. It should be noted that the techniques of the 
present invention may be practiced in many alternate networking configurations. As 
such, the exemplary network environment 100 of Fig. 1 should be taken as illustrative 
only. 

Often, it may be desirous to identify multi-media or multi-sensory data (or con- 
tent) when, for example an application program (not shown) executing on the user sys- 
tem 1 10 desires access to a remote file 135 stored on storage device 130 of first storage 
system 125 or a remote file 150 stored on storage device 145 of second storage sys- 
tem 140. In such a situation, it would be desirous to know if the data stored locally, i.e., 
file 120, is identical to the remotely stored file 135 (or file 150). If the data content of the 
remote and local files is identical, network bandwidth is conserved by accessing the lo- 
cally stored file. Additionally, as accesses to locally attached disks are typically faster 
than access to network-attached disks, performance of the user system 1 10 is increased 
by accessing local file 120 instead of remote file 135 (or file 150). 

As noted, one method of determining whether the content of the remote and local 
files is identical involves the use of a conventional data comparison technique. Here, the 
user system 1 10 performs a bit-by-bit comparison of local file 120 and e.g., remote 
file 135. However, in the course of performing the comparison, the entire remote file 
needs to be transferred over the network cloud 105 from the first storage system 125 to 
the user system 1 10, thereby obviating any improvements to system performance or sav- 
ings of network bandwidth. Alternatively, the user system 1 10 may rely on metadata as- 
sociated with files 120 and 135. However, if, for example the user 110 implements a file 
system that is different from a file system implemented by the first storage system 125, a 
distinct possibility exists that the metadata may not generate a correct match or may gen- 
erate false positives or negtives. 

In accordance with the present invention, however, the user may acquire a set of 
unique protocol markers from the file 120 and file 135. The markers are derived directly 
from the underlying information and are uniquely associated with this data. These proto- 
col markers are then compared to quickly determine whether file 120 is identical to 
file 135. Using the teachings of the present invention, the only information that needs to 
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be transmitted over the network cloud 105 from storage system 125 to the 
user system 1 10 are the unique protocol markers, which are typically orders of magnitude 
smaller than the complete file size. Note that the inventive technique applies similarly to 
network or file-system caching device 170. 

B. Protocol Marker Generation 

Protocol markers are generated as a byproduct of the conversion of TVMD con- 
tent into a form suitable for transmission over a transport medium in a computer network. 
The protocol markers are embedded in the resulting converted content and comprise re- 
siduals of various mathematical transformations performed on the content during conver- 
sion into an acceptable data format for transmission over the network. Each protocol 
generates a unique signature of protocol markers in accordance with the specific details 
of the protocol implementation. As such, the teachings of the present invention may be 
generalized to any protocol using the specific protocol's unique protocol markers. 

The process of converting the content into the appropriate format is typically de- 
fined by the specific protocol utilized. Broadly stated, protocol implementations utilize 
four basic steps: a sampling stage, one or more conversion stages, a quantization stage 
and an entropy compression stage. Fig. 2 is a flow diagram of an exemplary protocol 
conversion flow 200 used to convert content into an appropriate format in accordance 
with a specific protocol. The raw data or TVMD content is initially inputted into a sam- 
pling stage 205 where the raw data is sampled (or digitized) for use in the remainder of 
the protocol implementation. The sampling stage 205 may vary the sampling rate per the 
specific protocol specification to achieve a desired quality and/or file size. After the 
sampling stage 205, the sampled data is fed into one or more conversion stages 210 that 
convert the sample data into an appropriate format for a later quantization stage 215. 
Such a conversion may include, e.g., splitting an image into composite signals such as, 
e.g., the well-known RGB standard. 

The quantization stage 215 typically divides the range of values obtained from the 
sampling stage into a series of non-overlapping, but not necessarily equal, sub-ranges. A 
discrete and unique value is then assigned to each sub-range, which reduces information 
content but achieves compression. The output of the quantization stage 215 is fed into an 
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entropy compression stage 220. Entropy compression refers generally to a group of 
lossless compression techniques that may, for example, suppress repetitive sequences or 
utilize statistical encoding to reduce the size of the content embodied by the protocol. 
The output of the entropy compression stage and, thus, the protocol conversion flow 200, 
is the TVMD content encoded in the appropriate protocol. 

Fig. 3 is a schematic block diagram of an exemplary protocol conversion flow for 
the Moving Picture Expert Group part 2 (MPEG-2) video protocol. MPEG-2 protocol 
encoding and decoding is further described in ISO/IEC-13818, the contents of which are 
hereby incorporated by reference. It should be noted that the principles of the present 
invention may be utilized for any protocol that generates known protocol markers and is 
not limited to graphics, image, video or audio encoding protocols. As such the descrip- 
tion of encoding using MPEG-2 should be taken as exemplary only. 

The source input 305 e.g., audio-visual scene signals or frames, is fed into a 
frame-reordering stage 310. The frame reordering stage 310 ensures that individual 
frames of the video are in the proper order to be encoded depending on the individual 
frames type. For example, in the MPEG-2 standard, Intra pictures (I pictures) are coded 
using only information present in the picture itself, Predicted pictures (P pictures) that are 
coded with respect to the nearest previous I or P pictures and Bi-directional pictures (B 
pictures) use both a past and a future picture as a reference. Thus, for example, a B pic- 
ture must be encoded after all pictures that it relies upon have been encoded. The frame 
reordering stage 310 associates the pictures into a proper order for encoding. 

The properly ordered frames are then forwarded to a motion estimation stage 315. 
In accordance with the MPEG-2 protocol, the motion estimation stage operates on macro- 
blocks, which illustratively comprise 16x16 pixels within a frame. During the motion 
estimation stage 315, a selected macro-block of a current frame is compared with all 16 x 
16 regions of the frame that is being used to predict from, e.g., a previous I or P picture. 
The 16x16 region with the least mean-squared error from the current macro-block is 
then selected and a motion vector is encoded which specifies the 16 x 16 region that is 
being utilized to predict from and an error value for each pixel in the macro-block. 
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The output of the motion estimation phase 315 is then fed into a discrete cosign 
transformation (DCT) function 320. The DCT 320 transforms 8x8 blocks of pixels from 
a spatial domain to a frequency domain. More generally, the DCT 320 converts a block 
of pixels into a block of transformed coefficients, wherein the coefficients represent the 
spatial frequency components which make up the original block. After applying the 
DCT 320 the output is then fed to a quantization function (Q) 325. For a typical 8x8 
block, most of the DCT coefficients are almost zero ("near-zero"). Thus DCT coeffi- 
cients that are not close to zero are typically clustered around the DC value in the block. 
In the quantization step, the DCT coefficients are quantizied so that the near-zero coeffi- 
cients are set to a zero value and the remaining coefficients are represented with a re- 
duced precision. This is typically achieved by dividing each coefficient by a positive in- 
teger, which results in a loss of information but improved compression. Quantization 
may be achieved through use of a quantization table (QT) and dividing each element of 
the DCT results by the appropriate entry in the QT. Further compression is achieved by 
exploiting the statistical redundancies within quantizied DCT coefficient data. The 8x8 
block is then ordered via a well-known zig-zag pattern to create a large run of zeros. The 
non-zero coefficients, which are typically clustered near the beginning of the zig-zag or- 
dering, are encoded (at encoder stage 330) using a conventional variable length coding 
scheme. The large run of zeros, which is typically at the end of the ordering, is encoded 
using a run-length encoding, which typically transmits a specified number identifying a 
number of zeros to be transmitted. This further compresses the data. The output of the 
encoding stage 330 is fed into a buffer 335 for later transmission as encoded output 
data 340. The buffer 335 may be utilized to ensure a constant bit rate flow from the out- 
put of the encoder to match any requisite data flow of the desired transmission medium. 

In the above example of the MPEG-2 protocol conversion flow, various protocol 
markers are generated during the various transformations. These protocol markers in- 
clude discrete cosine (DC) coefficients, motion vectors, and quantization results. By 
comparing the protocol markers generated by the encoding of content using the MPEG-2 
protocol, data may be quickly identified. 
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C. Content Matching By Protocol Makers 

As noted above, the representation of "natural" real- world signals in a distributed 
environment is defined by a protocol, i.e., is a process by which the natural signal's in- 
formation content is transformed and prepared for transfer over the distributed storage 
medium. Included within the transformed content is a set of protocol markers that has 
been computed from the original information. In accordance with the novel technique 
described herein, these protocol markers may be utilized as a signature to efficiently 
identify the original information. The markers, or "content signature", may then be used 
in a content-based decision process to store, distribute or transport the content in a dis- 
tributed networking environment. In one embodiment, the use of protocol markers re- 
duces the amount of resources required in the analysis of content between two given lo- 
cations in a distributed network. For example, if the content at point A is identical to 
content at point B, it is not necessary to transfer the content from point A to point B 
across a network. Only the markers are transferred for comparison. By utilizing the 
novel technique described herein, the required bandwidth necessary to transmit informa- 
tion used to determine the identification of the content at various points is significantly 
reduced. 

Fig. 4 is an exemplary flow diagram of a content comparator 400 adapted to com- 
pare two different sets of content to determine if they are identical in accordance with an 
illustrative embodiment of the present invention. Two content inputs (Input A and B) are 
received at a protocol identification module 405 and at respective data segmentation 
modules 41 OA and 41 0B. The protocol identification module 405 analyzes the received 
inputs and uses various stored databases to identify the protocol utilized by the inputs. 
This may be accomplished by, for example, analyzing metadata of the received inputs or 
by comparing various blocks of their content with known protocol markers that are 
unique to a given protocol. The identification of a protocol of received content input is 
well-known to those skilled in the art and may be accomplished using a wide variety of 
techniques. 

The data segmentation modules 41 OA and 41 0B select various segments of the re- 
ceived content input for comparison. These data segments may be selected according to 
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the identified protocol implementation to ensure whether the analyzed segments contain 
sufficient protocol markers to perform signature computation and analysis. For example, 
certain protocols may store metadata or other protocol markers in a header or footer of a 
file. When such protocols are utilized, the data segmentation module 410 select those 
appropriate data segments from the input and passes them to a signature computation 
module 415. More generally, each data segmentation module 410 selects appropriate 
segments from the entire content input for delivery to the appropriate signature computa- 
tion module 41 5 A, B. 

The signature computation module 415 uses the delivered content segments to 
generate a signature of the content. Illustratively, such a signature may be computed by 
analyzing the content and identifying appropriate protocol markers. In the example of a 
JPEG (Joint Picture Expert Group) protocol, protocol markers could include discrete co- 
sine (DC) components, escape sequences, and/or a number of zeros. Similarly, in the ex- 
ample of MPEG (Motion Pictures Expert Group) protocol, protocol markers include 
those of the JPEG protocol and various motion vectors. The identified protocol markers 
comprising the content signature are then fed into a signature comparison module 420. 
The signature comparison module 420 compares the two generated signatures of the in- 
puts to determine if they are identical. It should be noted that the exemplary content 
comparator 400 may be implemented in hardware, software, firmware or a combination 
thereof in accordance with alternate embodiments of the present invention. More gener- 
ally, a content comparator 400 may be comprised of a plurality protocol marker identifi- 
ers, comprising of a protocol identification module 405, a data segmentation module 410 
and a signature computation module 415, associated with one or more signature compari- 
son modules 420. 

An exemplary procedure 500 for comparing received content with local content is 
shown in the flowchart of Fig. 5. As used herein, the term "local content" refers to data 
content that is stored on locally attached disks and may be readily accessed without using 
networking commands. Note that procedure 500 may be implemented in a network to 
determine if, for example, information being written to a remote disk that is identical to 
information stored locally. If the information is identical, a local device/system may de- 
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cide not to expend the network bandwidth needed to write the information to the remote 
disk. 

The procedure initially begins in step 505 and proceeds to step 510 where the 
content is received at a local device via a write request directed over a network or by any 
other acceptable data transfer means. In step 515, the protocol used to encode the re- 
ceived content is determined. As noted above, the determination of a protocol used to 
encode content may be determined using a variety of techniques that are well-known to 
those skilled in the art. 

After the protocol of the received content has been determined, in step 520 a de- 
termination is made whether the protocol is available for comparison. For example, it 
may be detected that the received content is encoded in the TIFF protocol; however, the 
hardware or software implementation of the system embodying the inventive procedure 
does not contain the TIFF protocol markers for use when computing a signature of the 
received content. This could occur when, for example, a new protocol is created, but be- 
fore appropriate protocol markers for signature generation are implemented in the system 
embodying the procedure. If the protocol is not available, the procedure exits in step 525 
without comparing the two data contents. 

However, if the protocol is available, then, in step 530, the procedure computes 
the signature of the received content using the appropriate protocol markers for the proto- 
col associated with that content. Next, in step 535, the computed signature of the re- 
ceived input content is compared with the local content signature and, in step 540 a de- 
termination is made as to whether a match has occurred. If there is no match between the 
received input content and the local content signatures, the procedure branches to 
step 545 and identifies the received content as being different from the local content. 
Otherwise the procedure continues to step 550 and identifies the received content as be- 
ing identical to the local content. The procedure then completes in step 555. 

D. Caching Using Protocol Markers 

In an illustrative embodiment, the techniques of the present invention may be im- 
plemented by a network caching device. By analyzing received data content, a caching 
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device may quickly determine whether the data is already stored in its network cache. If 
such data is already stored in its cache, the network caching device may terminate the 
transmission and utilize the stored copy of the content. The use of a stored, local copy 
may significantly improve system performance and reduce the amount of network band- 
width utilized. 

An exemplary procedure 600 for implementing the teachings of the present in- 
vention within a network caching environment is shown in the flowchart of Fig. 6. The 
procedure begins in step 605 and continues to step 610 where the network caching device 
receives new content. The new content may be received by the caching device by way of 
data transmission from another device in the network. In step 615, a determination is 
made as to the protocol of the new content using conventional protocol determination 
techniques. 

Once the protocol has been determined, a determination is made as to whether the 
protocol is available in this particular network caching device (step 620). If the protocol 
is not available the procedure branches to step 655 where a cache miss is generated and 
output. This may occur when, for example, appropriate protocol markers for the identi- 
fied protocol have not been incorporated into the network caching device. Otherwise the 
procedure continues to step 625 where the length of the new content is computed. This 
may be accomplished by conventional techniques used to identify the size of a data file. 
In step 630, the length of the content stored in the network cache is compared with the 
length of the new content to determine if there is a match. If not, the new content is not 
the same size as the stored content and the procedure branches to step 655 and outputs a 
cache miss. 

However, if there is a match the procedure continues to step 635 where the sig- 
nature of the new content is computed using known protocol markers associated with the 
identified protocol of the new content. Then, in step 640, the computed signature of the 
new content is compared with the signature of content stored in the network cache. If the 
two signatures do not match, the procedure branches to step 655 and outputs a cache 
miss. Otherwise, the procedure continues to step 645 where a cache hit is generated and 
output. The procedure then completes at step 650. In alternate embodiments, a network 

14 

H:\112\056\0110\PROSECim0110.doc 09/30/03 11:15 AM 



PATENT 
P01-1564/1 12056-01 10 

caching device only utilizes the generated signature of the new content in making a cache 
hit determination. 

The concepts used in a cache device can be generalized to include storage re- 
source management (SRM) techniques. For example, file walking a file system is possi- 
ble from a host device. The file walking system stores metadata associated with each file 
in a data structure and/or database. When protocol markers are included in the metadata 
a more robust identification technique is available to identify repeated files. 

To summarize, the present invention provides a technique for identification of in- 
formation based upon protocol markers. By using a priori knowledge of specific proto- 
col implementations, a set of protocol markers may be obtained from a specified file to 
generate a signature of the content. The signature may then be compared with signatures 
of other information to quickly differentiate and/or compare information content. Using 
the principles of the present invention, only the protocol markers comprising the signa- 
ture of the content need to be transmitted and compared to differentiate between two data 
contents. 

The foregoing description has been directed to specific embodiments of this in- 
vention. It will be apparent, however, that other variations and modifications may be 
made to the described embodiments, with the attainment of some or all of their advan- 
tages. Specifically, it should be noted that any protocol may be utilized with the teach- 
ings of the present invention provided that the protocol generates acceptable markers for 
use in creating a signature of content. Additionally, the procedures or processes de- 
scribed herein may be implemented in hardware, software, embodied as a computer- 
readable medium having program instructions, firm ware, or a combination thereof. 
Therefore, it is the object of the appended claims to cover all such variations and modifi- 
cations as come within the true spirit and scope of the invention. 

What is claimed is: 
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